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Abstract 

A large spectrum of applications such as location based services and environmental monitoring demand efficient query 
processing on uncertain databases. In this paper, we propose the probabilistic Voronoi diagram (PVD) for process- 
ing moving nearest neighbor queries on uncertain data, namely the probabilistic moving nearest neighbor (PMNN) 
queries. A PMNN query finds the most probable nearest neighbor of a moving query point continuously. To process 
PMNN queries efficiently, we provide two techniques: a pre-computation approach and an incremental approach. 
In the pre-computation approach, we develop an algorithm to efficiently evaluate PMNN queries based on the pre- 
computed PVD for the entire data set. In the incremental approach, we propose an incremental probabilistic safe 
region based technique that does not require to pre-compute the whole PVD to answer the PMNN query. In this incre- 
mental approach, we exploit the knowledge for a known region to compute the lower bound of the probability of an 
object being the nearest neighbor. Experimental results show that our approaches significantly outperform a sampling 
based approach by orders of magnitude in terms of I/O, query processing time, and communication overheads. 

Keywords: Voronoi diagrams, continuous queries, moving objects, uncertain data 



1. Introduction 

Uncertainty is an inherent property in many database applications that include location based services JT], envi- 
ronmental monitoring |2|, and feature extraction systems 0. The inaccuracy or imprecision of data capturing devices, 
the privacy concerns of users, and the limitations on bandwidth and battery power introduce uncertainties in different 
attributes such as the location of an object or the measured value of a sensor. The values of these attributes are stored 
in a database, known as an uncertain database. 

In recent years, query processing on an uncertain database has received significant attention from the research 
community due to its wide range of applications. Consider a location based application where the location information 
of users may need to be pre-processed before publishing due to the privacy concern of users. Alternatively, a user may 
want to provide her position as a larger region in order to prevent her location to be identified to a particular site. 
In such cases, locations of users are stored as uncertain attributes such as regions instead of points in the database. 
An application that deals with the location of objects (e.g., post office, hospital) obtained from satellite images is 
another example of an uncertain database. Since the location information may not be possible to identify accurately 
from the satellite images due to noisy transmission, locations of objects need to be represented as regions denoting 
the probable locations of objects. Likewise, in a biological database, objects identified from microscopic images need 
to be presented as uncertain attributes due to inaccuracies of data capturing devices. 

In this paper, we propose a novel concept called Probabilistic Voronoi Diagram (PVD), which has a potential 
to efficiently process nearest neighbor (NN) queries on an uncertain database. The PVD for a given set of uncertain 
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objects oi, 02, o„ partitions the data space into a set of Probabilistic Voronoi Cells (PVCs) based on the probability 
measure. Each cell PVC{oi) is a region in the data space, where each data point in this region has a higher probability 
of being the NN to o, than any other object. 

A nearest neighbor (NN) query on an uncertain database, called a Probabilistic Nearest Neighbor (PNN) query, 
returns a set of objects, where each object has a non-zero probability of being the nearest to a query point. A common 
variant of the PNN query that finds the most probable NN to a given query point is also called a top- 1 -PNN query. 
Existing research focuses on efficient processing of PNN queries E] [5] |6] Q and its variants JU QlD for a static 
query point. In this paper, we are interested in answering Probabilistic Moving Nearest Neighbor (PMNN) queries on 
an uncertain database, where data objects are static, the query is moving, and the future path of the moving query is 
unknown. A PMNN query returns the most probable nearest object for a moving query point continuously. 

A straightforward approach for evaluating a PMNN query is to use a sampling-based method, which processes the 
PMNN query as a sequence of PNN queries at sampled locations on the query path. However, to obtain up-to-date 
answers, a high sampling rate is required, which makes the sampling-based approach inefficient due to the frequent 
processing of PNN queries. 

To avoid high processing cost of the sampling based approach and to provide continuous results, recent approaches 
for continuous NN query processing on a point data set rely on safe-region based techniques, e.g., Voronoi dia- 
gram ifTTl . In a Voronoi diagram based approach, the data space is partitioned into disjoint Voronoi cells where all 
points inside a cell have the same NN. Then, the NN of a query point is reduced to identifying the cell for the query 
point, and the result of a moving query point remains valid as long as it remains inside that cell. Motivated by the 
safe-region based paradigm, in this paper we propose a Voronoi diagram based approach for processing a PMNN 
query on a set of uncertain objects. 

Voronoi diagrams for uncertain objects |6| [T2| based on a simple distance metric, such as the minimum and 
maximum distances to objects, result in a large neutral region that contains those points for which no specific NN 
object is defined. Thus, these are not suitable for processing a PMNN query. In this paper, we propose the PVD that 
divides the space based on a probability measure rather than using just a simple distance metric. 

A naive approach to compute the PVD is to find the top- 1 -PNN for every possible location in the data space using 
existing static PNN query processing techniques J4][5]0, which is an impractical solution due to high computational 
overhead. In this paper, we propose a practical solution to compute the PVD for a set of uncertain objects. The key 
idea of our approach is to efficiently compute the probabilistic bisectors between two neighboring objects that forms 
the basis of PVCs for the PVD. 

After computing the PVD, the most probable NN can be determined by simply identifying the PVC in which the 
query point is currently located. The result of the query does not change as long as the moving query point remains 
in the current PVC. A user sends its request as soon as it exits the PVC. Thus, in contrast to the sampling based 
approach, the PVD ensures the most probable NN for every point of a moving query path is available. Since this 
approach requires the pre-computation of the whole PVD, we name it the pre-computation approach in this paper. 

The pre-computation approach needs to access all the objects from the database to compute the entire PVD. In 
addition, the PVD needs to be re-computed for any updates (insertion or deletion) to the database. Thus the pre- 
computation approach may not be suitable for the cases when the query is confined into a small region in the data 
space or when there are frequent updates in the database. For such cases, we propose an incremental algorithm based 
on the concept of local PVD. In this approach, a set of surrounding objects and an associated search space, called 
known region, with respect to the current query position are retrieved from the database. Objects are retrieved based 
on their probabilistic NN rankings from the current query location. Then, we compute the local PVD based only on the 
retrieved data set, and develop a probabilistic safe region based PMNN query processing technique. The probabilistic 
safe region defines a region for an uncertain object where the object is guaranteed to be the most probable nearest 
neighbor. This probabilistic safe region enables a user to utilize the retrieved data more efficiently and reduces the 
communication overheads when a client is connected to the server through a wireless link. The process needs to be 
repeated as soon as the retrieved data set cannot provide the required answer for the moving query point. We name 
this PMNN query processing technique the incremental approach in this paper. 

In summary, we make the following contributions in this paper: 

• We formulate the Probabilistic Voronoi Diagram (PVD) for uncertain objects and propose techniques to com- 
pute the PVD. 
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• We provide an algorithm for evaluating PMNN queries based on the pre-computed PVD. 

• We propose an incremental algorithm for evaluating PMNN queries based on the concept of local PVD. 

• We conduct an extensive experimental study which shows that our PVD based approaches outperform the 
sampling based approach significantly. 

The rest of the paper is organized as follows. Section [2] discusses preliminaries and the problem setup. Section[3] 
reviews related work. In Section|4] we formulate the concept of PVD and present methods to compute it, focusing on 
one and two dimensional spaces. In Section|5] we present two techniques: pre-computation approach and incremental 
approach for processing PMNN queries. Section|6]reports our experimental results and Section|7]concludes the paper. 



2. Preliminaries and Problem Setup 

Let O be a set of uncertain objects in a of-dimensional data space. An uncertain object o, 6 O, 1 < i < \0\, 
is represented by a li-dimensional uncertain range R, and a probability density function (pdf) fi(u) that satisfies 
J R f(u)du — 1 for u e Rj. If u £ R„ then f(u) = 0. We assume that the pdf of uncertain objects follow uniform 
distributions for the sake of easy explication. Our concept of PVD is applicable for other types of distributions. We 



briefly discuss PVDs for other distributions in Section 4.3 1. For uniform distribution, the pdf of o, can be expressed as 
fi(u) = Are l( R ^ for u e Rj. For example, for a circular object o,-, the uncertainty region and the pdf are represented as 

Ri = (cj, r,) and f(u) = -K, respectively, where c, is the center and r,- is the radius of the region. We also assume that 

the uncertainty of objects remain constant. 

An NN query on a traditional database consisting of a set of data points (or objects) returns the nearest data point 
to the query point. An NN query on an uncertain database does not return a single object, instead it returns a set of 
objects that have non-zero probabilities of being the NN to the query point. Suppose that the database maintains only 
point locations cu Ci, and C3 for objects o\, 02, and 03, respectively (see Figure [TJ. Then an NN query with respect 
to q returns 02 as the NN because the distance dist(c2, q) is the least among all other objects. In this case, o\ and 
03 are the second and third NNs, respectively, to the query point q. If the database maintains the uncertainty regions 
R\ = (ci,ri), R2 = (C2, T2), and ^3 = (c3,r3) for objects o\, 02, and 03, respectively, then the NN query returns all 
three (pi,pi), (02, P2), (03,^3) as probable NNs for the query point q, where p\ > p2 > P3 > (see Figure [TJ. 

A Probabilistic Nearest Neighbor (PNN) query J4] is defined as follows: 

Definition 2.1. (PNN) Given a set O of uncertain objects in a d-dimensional database, and a query point q, a PNN 
query returns a set P of tuples (o,, where Oi € O and pi is the non-zero probability that the distance of Oi to q is 
the minimum among all objects in O. 

The probability p(pi, q) (or simply pf) of an object o, of being the NN to a query point q can be computed as 
follows. For any point u e where R, is the uncertainty region of an object o,, we need to first find out the probability 
of Oi being at u and multiply it by the probabilities of all other objects being farther than u with respect to q, and then 
summing up these products for all u to compute p,. Thus, p, can be expressed as follows: 

Pi - I ./i'( M )( I P(dist(v,q) > dist(u, q))dv)du, (1) 

JueR, JreRj 

where the function P(.) returns the probability that a point v e Rj of oj is farther from a point u e Rj of o,. 

Figure [T] shows a query point q, and three objects o\, 02, and 03. Based on Equation[T[ the probability p\ of object 
o\ being the NN to q can be computed as follows. In this example, we assume a discrete space where the radii of three 
objects are 5, 2, and 3 units, respectively, and the minimum distance of o\ to q is 5 units. Suppose that the dashed circles 
(g, 5), (g, 6), (g,7), (g, 8), and (g, 9) centered at q with radii 5, 6, 7, 8, and 9 units, respectively, divide the uncertain 
region R\ of o\ into four sub-regions 0\ v o l2 , o 1} , and Oi 4 , where Oi, = (c^rj) n (g,6), oy 2 - (c\,r{) n (q,7) - Oi,, 
oi 3 = (ci, ri) n (q, 8) - (oi, U o\ 2 ), o\ A - (c\, r\) n (q, 9) - (oj, U o\ 2 U oi 3 ); similarly R2 is divided into six sub-regions 
Oi,, 02 2 , °2 3 , °2 4 , 02 5 , and 02 6 ; ^3 is divided into three sub-regions 03,, o^ 2 , and 0%. 
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Figure 1 : An example of a PNN query 

Then p\ can be computed by summing: (i) the probability of o\ being within the sub-region o\ i multiplied by the 
probabilities of 02 and 03 being outside the circular region (q, 6), (ii) the probability of o\ being within the sub-region 
oi 2 multiplied by the probabilities of 02 and 03 being outside the circular region (q, 7), (iii) the probability of o\ being 
within the sub-region o l3 multiplied by the probabilities of 02 and 03 being outside the circular region (q, 8), and (iv) 
the probability of o\ being within the sub-region e>i 4 multiplied by the probabilities of 02 and 03 being outside the 
circular region (q, 9). 

As we have discussed in the introduction, in many applications a user may often be interested in the most probable 
nearest neighbor. In such cases, a PNN only returns the object with the highest probability of being the NN, also 
known as a top- 1 -PNN query. In this paper, we address the probabilistic moving NN query that continuously reports 
the most probable NN for each query point of a moving query. 

From Equation[T| we see that finding the most probable NN to a static query point is expensive as it involves costly 
integration and requires to consider the uncertainty of other objects. Hence, for a moving user that needs to be updated 
with the most probable answer continuously, it requires repetitive computation of the top object for every sampled 
location of the moving query. In this paper, we propose PVD based approaches for evaluating a PMNN query. 

In this paper, we propose two techniques: a pre-computation approach and an incremental approach to answer 
PMNN queries. Based on the nature of applications, one can choose any of these techniques that suits best for her 
purpose. Moreover, both of our techniques fit into any of the two most widely used query processing paradigms: 
centralized paradigm, and client-server paradigm. In the centralized paradigm the query issuer and the processor 
reside in the same machine, and the total query processing cost is the main performance measurement metric. On the 
other hand, in the client-server paradigm, a client issues a query to a server that processes the query, through wireless 
links such as mobile phone networks. Thus, in the client-server paradigm the performance metric includes both the 
communication cost and the query processing cost. 

In the rest of the paper, we use the following functions: min(v\,V2, V„) and max(y\,vi, V„) return the mini- 
mum and the maximum, respectively, of a given set of values v\, V2,...,v„; dist(jp\,p2) returns the Euclidian distance 
between two points p\ and pi\ mindistip, o) and maxdistip, o) return the minimum and maximum Euclidian distances, 
respectively, between a point p and an uncertain object o. 

We also use the following terminologies. When the possible range of values of two uncertain objects overlap then 
we call them overlapping objects; otherwise they are called non-overlapping objects. If the ranges of two objects are 
of equal length then we call them equi-range objects; otherwise they are called non-equi-range objects. 

3. Background 

In this section, we first give an overview of existing PNN query processing techniques on uncertain databases that 
are closely related to our work. Then we present existing work on Voronoi diagrams. 

3.1. Probabilistic Nearest Neighbor 

Processing PNN queries on uncertain databases has received significant attention in recent years. In j4|, 
Cheng et al. proposed a numerical integration based technique to evaluate a PNN query for one-dimensional sen- 
sor data. In [5], an I/O efficient technique based on numerical integration was developed for evaluating PNN queries 
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on two-dimensional uncertain moving object data. In [7|, authors presented a sampling based technique to compute 
PNN, where both data and query objects are uncertain. Probabilistic threshold NN queries have been introduced 
in lfT3ll . where all objects with probabilities above a specified threshold are reported. In lfl4ll . a PNN algorithm was 
presented where both data and query objects are static trajectories, where the algorithm finds objects that have non- 
zero probability of any sub-intervals of a given trajectory. Lian et al. [ 15] presented a technique for a group PNN 
query that minimizes the aggregate distance to a set of static query points. 

The PNN variant, top-£-PNN query reports top k objects which have higher probabilities of being the nearest than 
other objects in the database (8l|9][T0l. Among these works, techniques 10 [10) aim to reduce I/O and CPU costs 
independently. In [8|, the authors proposed a unified cost model that allows interleaving of I/O and CPU costs while 
processing top-^-PNN queries. This method [8| uses lazy computational bounds for probability calculation which is 
found to be very efficient for finding top-fc-PNN. 

Any existing methods for static PNN queries (4l|5]|2l or its variants [8 9j|T0) can be used for evaluating PMNN 
queries which process the PMNN query as a sequence of PNN queries at sampled locations on the query path. Since in 
this paper we are only interested in the most probable answer, we use the recent technique O to compute top- 1 -PNN 
for processing PMNN queries in a comparative sampling based approach and also for the probability calculation in 
the PVD. 

Some techniques |[T6l[T7l have been proposed for answering PNN queries (including top-£-PNN) for existentially 
uncertain data, where objects are represented as points with associated membership probabilities. However, these 
techniques are not related to our work as they do not support uncertainty in objects' attributes. Our problem should 
also not be confused with maximum likelihood classifiers [18] where they use statistical decision rules to estimate the 
probability of an object being in a certain class, and assign the object to the class with the highest probability. 

All of the above mentioned schemes assume a static query point for PNN queries. Though, continuous processing 
of NN queries for a moving query point on a point data set was also a topic of interest for many years [ 19 1, we are the 
first to address such queries on an uncertain data set. In this paper, we propose efficient techniques for probabilistic 
moving NN queries on an uncertain database, where we continuously report the most probable NN for a moving query 
point. 

3.2. Voronoi Diagrams 

The Voronoi diagram ifTTI is a popular approach for answering both static and continuous nearest neighbor queries 
for two-dimensional point data lEUl . Voronoi diagrams for extended objects (e.g., circular objects) ETI have been 
proposed that use boundaries of objects, i.e., minimum distances to objects, to partition the space. However, these 
objects are not uncertain, and thus, ETIl cannot be used for PNN queries. 

Voronoi diagrams for uncertain objects have been proposed that can divide the space for a set of sparsely dis- 
tributed objects ]6][l2]. Both of these approaches are based on the distance metric, where mindist and maxdist to 
objects are used to calculate the boundary of the Voronoi edges. 

The Voronoi diagram of lfl2ll can be described as follows. 

Let Ri,R2,..., R„ be the regions of a set O of uncertain objects o\, 02, o n , respectively. Then a set of sub-regions 
or cells V\, V2, V„ in the data space can be determined such that a point in V, must be closer to any point in than 
to any point in any other object's region. For two objects o, and o,-, let H(i, j) be the set of points in the space that are 
at least as close to any point in Ri as any point in Rj, i.e., 

H(i, j) = {p\\Vx G RjVy e Rj dist(p, x) < dist(p,y)}, 

where p is a point in the data space. 

Then, the cell V, of object o, can be defined as follows: 

V i = n j ± i H(i,j). 

The boundary B(i, j) of H(i, j) can be defined as a set of points in H(i, j), where p e B(i, j) and maxdistip, o,) = 
mindist(p, oj). If the regions are circular, the boundary of object o, with oj is a set of points p that holds the following 
condition: 

dist(p, c,) + r,- = dist(p, cj) — rj, 
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where c, and cj are the centers and r, and r, are the radii of the regions for objects o, and oj, respectively. 

Since r, and r$ are constants, the points p that satisfy the above equation lie on the hyperbola (with foci c, and cj) 
arm closest to o,-. Figure [2] shows an example of this Voronoi diagram for uncertain objects o\ and o-i- The figure also 
shows the neutral region (the region between two hyperbolic arms) for which the NN cannot be defined by using this 
Voronoi diagram. Since this Voronoi diagram divides the space based on only the distances (i.e., mindist and maxdist 
of objects), there may not exist any partition of the space when there is no point such that mindist of an object is equal 
to maxdist of the other object, i.e., when the regions of objects overlap or too close to each other. 

In this approach, a Voronoi cell V, only contains those points in the data space that have o, as the nearest object 
with probability one. Thus, this diagram is called a guaranteed Voronoi diagram for a given set of uncertain objects. 
However, in our application domain, an uncertain database can contain objects with overlapping ranges or objects 
with close proximity (or densely populated) J4][5]0- Hence a PNN query returns a set of objects (possibly more than 
one) which have the possibilities of being the NN to the query point. Having such a data distribution, the guaranteed 
Voronoi diagram cannot divide the space at all, and as a result the neutral regions cover most of the data space for 
which no nearest object can be determined. However, for an efficient PMNN query evaluation we need to continuously 
find the most probable nearest object for each point of the query path. We propose a Probabilistic Voronoi Diagram 
(PVD) that works for any distribution of data objects. 

Cheng et al. [6| also propose a Voronoi diagram for uncertain data, called Uncertain- Voronoi diagram (UV- 
diagram). The UV-diagram partitions the space based on the distance metric similar to the guaranteed Voronoi di- 
agram lfl2l . For each uncertain object o,, the UV-diagram defines a region (or UV-cell) where o,- has a non-zero 
probability of being the NN for any point in this region. The main difference of the UV-diagram from the guaranteed 
Voronoi diagram is that the guaranteed Voronoi diagram concerns about finding the region for a object where the 
object is guaranteed to be the NN for any point in this region, on the other hand UV-diagram concerns about defining 
a region for an object where the object has a chance of being the NN for any point in this region. For example, in 
Figure [2] all points that are left side of the hyperbolic arm closest to 02 have non-zero probabilities of o\ being the 
NN, and thus the region left to this hyperbolic line (i.e., closest to o-j) defines the UV-cell for object o\. Similarly, the 
region right to the hyperbolic line closest to o\ defines the UV-cell for object 02. Since both UV-diagram and guar- 
anteed Voronoi diagram are based on the concept of similar distance metrics, the UV-diagram suffers from similar 
limitations as of the guaranteed Voronoi diagram (as discussed above) and is not suitable for our purpose. 



4. Probabilistic Voronoi Diagram 

A Probabilistic Voronoi Diagram (PVD) is defined as follows: 

Definition 4.1. (PVD) Let O be a set of uncertain objects in a d-dimensional data space. The probabilistic Voronoi 
diagram partitions the data space into a set of disjoint regions, called Probabilistic Voronoi Cells (PVCs). The PVC 
of an object o,- e O is a region or a set of non-contiguous region, denoted by PVC(oi), such that p(Oi, q) > p(oj, q)for 
any point q e PVC(o,) and for any object oj € O — {o,}, where p(oi, q) and p(oj, q) are the probabilities of Oj and oj 
of being the NNs to q. 

The basic idea of computing a PVD is to identify the PVCs of all objects. To find a PVC of an object, we need to 
find the boundaries of the PVC with all neighboring objects. The boundary line/curve that separates two neighboring 
PVCs is called the probabilistic bisector of two corresponding objects, as both objects have equal probabilities of being 
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the NNs for any point on the boundary. Let o, and oj be two uncertain objects, pb . . be the probabilistic bisector of 
o, and oj that separates PVC(oj) and PVC(oj). Then, for any point q e pb 0i0 ., p(oi,q) = p(oj,q), and for any point 
q e PVC(oi), p(oi,q) > p(oj,q), and for any point q e PVC(oj), p(oj,q) < p(oj,q). 

A naive approach to compute the PVD requires the processing of PNN queries by using Equation [T] at every 
possible location in the data space for determining the PVCs based on the calculated probabilities. This approach is 
prohibitively expensive in terms of computational cost and thus impractical. In this paper, we propose an efficient and 
practical solution for computing the PVD for uncertain objects. Next, we show how to efficiently compute PVDs, 
focusing on 1-dimensional (ID) and 2-dimensional (2D) spaces. We briefly discuss higher dimensional cases at the 
end of this section. 



4.1. Probabilistic Voronoi Diagram in a ID Space 

Applications such as environmental monitoring, feature extraction systems capture one dimensional uncertain 
attributes, and store these values in a database. In this section, we derive the PVD for ID uncertain objects. 

An uncertain ID object o, can be represented as a range [/,-, «,], where U and m, are lower and upper bounds of 
the range. Let m, and n, be the midpoint and the length of the range [/;,«,], i.e., m, = and n, = m, - /,. The 
probabilistic bisector pb oto , of two ID objects o,- and oj is a point x within the range [min(lj, lj), max(ui, uj)] such that 
p(oi,x) = p(oj,x), and p(oi,x') > p(oj,x') for any point x' < x and p(oi,x") < p(oj,x") for any point x" > x. Since 
only the equality condition is not sufficient, other two conditions must also hold. In our proof for lemmas, we will 
show that a probabilistic bisector needs to satisfy all three conditions. 

For example, Figure[3|b) shows two uncertain objects o\ and 02, and their probabilistic bisector pb 0l02 as a point 
x. In this example, the lengths of range for o\ and 02 are n\ — 8 and «2 = 4, respectively, and the minimum distances 
from x to o\ and 02 are d\ = 1 and dj = 3, respectively. Then based on Equation[T] we can compute the probabilities 
of o\ and 02 of being the NN to x as follows: 

2 4 13 12 11 14 
K0l '* )= 8-4 + 8-4 + 8-4 + 8-4 = 32' 



and 



15 14 13 12 14 
p(oj,x) — — ■ — I — ■ — I — ■ — 1 — ■ — = — . 
FW ' 4 8 4 8 4 8 4 8 32 



A naive approach for finding the pb 0j0j requires the computation of probabilities (using EquationJTJ of o ; and oj for 
every position within the range [min(li, lj), max{u,, uj)]. To avoid high computational overhead of this naive approach, 
in our method we show that for two equi-range objects (i.e., n, = nj), we can always directly compute the probabilistic 



bisector (see Lemma 4. 1 1 by using the upper and lower bounds of two candidate objects. Similarly, we also show that 
for two non-equi-range objects, where n, + rij, we can directly compute the probabilistic bisector for certain scenarios 
shown in Lemmas |4.2|4.3| and for the remaining scenarios of non-equi-range objects we exploit these lemmas to find 
probabilistic bisectors at reduced computational cost. 



Next, we present the lemmas for ID objects. Lemma 4.1 gives the probabilistic bisector of two equi-range objects, 



overlapping and non-overlapping. Figure[3ja) is an example of a non-overlapping case. (Note that if Z, = lj and u -, - uj, 
then two objects o,- and oj are assumed to be the same and no probabilistic bisector exists between them.) 

Lemma 4.1. Let oi and oj be two objects where m,- + mj. If n, — nj, then the probabilistic bisector pb 0i0j of Oj and oj 
is the bisector ofnii and nij. 

Proof. Let o, and oj be two equi-range objects, i.e., «, = nj. Let x be the bisector of two midpoints m, and nij, i.e., 

m,+m. 

Then, by using Equation[T] we can calculate the probability of o, being the NN to x as follows. 

n,-l , 

1B/-J 



p(Pi,x) = > . 

n -, n -, 
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Figure 3 : Scenarios of lemmas 



Similarly, we can calculate the probability of oj being the NN to x, as follows. 

If we put n, = n ; in the above two equations, we have p{oi,x) - p(oj,x). Thus, the probabilities of o, and Oj of 
being the NN from the point x are equal. 

Now, let x = - l y j - - e be a point on the left side of x. Then we can calculate the probability of o, of being the NN 
to x' 

iHj-S 



rii—Ze , 

p(Oi, x ) = 2e + > 

Rift; ^— ' «; «/ 
J s= 1 



Similarly, we can calculate the probability of oj being the NN to x', as follows. 

1 n,- — s 



p(o ; ,x')= V '— 

i=2e+l 1 



Now, if we put n, = «j in the above two equations, then we have p(oi, x') > p(oj, x') at xf '. Similarly we can prove 
that p(pi, x") < p(oj, x") for a point x" on the right side of x. 

Thus, we can conclude that x is the probabilistic bisector of o,- and oj, i.e., pb 0i0 . = x. 

The following lemma shows how to compute the probabilistic bisector of two non-equi-range objects that are 
non-overlapping (see Figure (3jb)). 

Lemma 4.2. Let Oj and oj be two non-overlapping objects, where + nj. If there are no other objects within the 
range [min(li, If), max(ui, u j)\ then the probabilistic bisector pb 0j0j of oi and oj is the bisector of mi and mj. 
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fit ' ~rtll ' 

Proof. Let n,- > rij, and x be the bisector of two midpoints m, and nij of objects o, and oj, respectively, i.e., x — ' 2 ' , 
and the minimum distances from x to o, and oj are di and afy, respectively. 

Then, by using Equation[T] we can calculate the probability of o, being the NN to x as follows. 

\ tlj r-, 1 Hj - S 

p(o t , x) = (dj -di) + > 

«; n i ^ n,- n ,■ 

.5=1 

n ; - _ 1) 

= - c/ ; ) + — . 

rijHj Iriirij 

Similarly, we can calculate the probability of Oj being the NN to x as follows. 

1 rij — (dj — d; + s) 



p(oj, X ) = y - 



l "] 



Since, we have dj - di = i.e., n, = 2(dj - di) + rij. By replacing n, in the numerator of p(oj, x), we can have the 
following, 



p(o j ,x) = Y J - 



'~\ 1 2(dj - di) + nj - (dj - di + s) 



rij nj(tij — 1) 
= (dj - dd + — . 

11,11 , 2.11 ill ! 

Since p(oi, x) = p(oj, x), we have pb 0j0j = x. 

On the other hand, let x - ' 2 1 - e be a point on the left side of the probabilistic bisector. 
Then, by using Equation[T] we can calculate the probability of o, being the NN to x' as follows. 



rij fij(rij - 1) 

p(oi, x ) = (dj - di + 2e) + — . 

ii.ii j 2)i,n ■ 

Similarly, we can calculate the probability of Oj being the NN to x', as follows. 



rij Hj(rij — 1) 

p(Oj, x ) = (dj - di - 2e)— + — — -. 

rijrij In, n , 

So, we can say p(o t , x') > p(oj, x') for a point x' on the left side of pb 0i0j . Similarly we can prove that p(o t , x") < 
p(oj, x") for a point x" on the right side of pb 0j0j . 

For two non-equi-range objects that are overlapping, the following lemma directly computes the probabilistic 
bisector for the scenarios where lower, upper, or mid-point values of two candidate objects are same (see Figure [3jc), 
(d), and (e)). 

Lemma 4.3. Let Oj and Oj be two overlapping objects, where n t + nj, /,- < lj < Uj < u it and there are no other objects 
within the range [min(li, lj), max(ui, Uj)]. 

1. Ifli — lj, then the probabilistic bisector pb . . of Oi and Oj is the bisector of mi and Uj. 

2. If Ui — Uj, then the probabilistic bisector pb 0j0j of Oi and Oj is the bisector of mi and lj. 

3. If mi = nij, then the probabilistic bisectors pb 0j0j of Oj and Oj are the bisectors ofU and lj, and Ui and Uj. 

Proof. Let «, > rij, U = lj, x = l ^- L , and d be the distance from x to both m, and lj. 
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Then, by using Equation[T] we can calculate the probability of o, being the NN to x as follows. 



-i 



PC 



v 2 rij v 2 nj - s 
Oi,x)= > + > . 

s=\ 1 s=\ 



_ 2d nj{rij - 1) 
Hi riiUj 

Similarly, we can calculate the probability of oj being the NN to x as follows. 



p(o j ,x)=^ j - 



1 m-(2d + 2s) 



— i " J n i 

However, j - rij — 2d, that is n, - Ad + 2nj. By replacing «, in the numerator and simplifying the term, we can have 
the following, p(oj,x) — ^ + 1 n ' n . Since p{oi,x) = p(oj,x), pb 0l0j = x. Similar to Lemma 4.2 we can prove that 



pipi, x') > p(oj, x') for any point x' on the left, and p(o,, x") < p(o j, x") for any point x" on the right side of pb 0j0j . 
Similarly, we can prove the case for u, = uj. 

Let nii = nij, X\ — J ^ L , and d be the distance from X\ to both Z, and lj. 

Then, by using Equation[T] we can calculate the probability of o, being the NN to x\ as follows. 

p(o h xi) = > + > , 

2d rij(nj - 1) 
= — + — — . 

n,- 2n,n ; - 

Similarly, we can calculate the probability of oj being the NN to x\ as follows. 

1 - (2d + s) 



p(0j,xi) = Yun 



— , l; Hi 

However, j - -4 — 2d, that is n, - Ad + rij. By replacing «, in the numerator and simplifying the term, we can have 

we can 



4.2 



the following, p(oj, x\) — ^ + " ^" J n ) . Since p(oj, x') = p(oj, x'), we have pb 0j0j = x\ . Similar to Lemma 
prove that pip,, x') > p(oj, x') for any point x' on the left, and p(o,, x") < p(oj, x") for any point x" on the right side 
oipb a . 0] . 

Similarly, we can prove that the other probabilistic bisector exists at x^ = ^y^, as the case is symmetric to that of 

X\. 

Note that, since n, > rij and m, = ntj, Oj completely contains Oj. Thus the probability of oj is higher than that of o,- 
around the mid-point (m,), and the probability of o, is higher than that of oj towards the boundary points (/, and «,). 
Therefore in this case, we have two probabilistic bisectors between o, and oj. 



Figures [5|c-e) show an example of three cases as described in Lemma 4.3 Figure |5Jc) shows the first case for 



objects oi and 02, where /] = I2 and pb 0l02 = '"'^" 2 . Similarly, Figure |5Jd) shows an example of the second case 
for objects o\ and 02, where u\ — u-i and pb 0l0l = Finally, Figure |3je) shows an example of the third case 

for objects o\ and 02, where m\ = ni2, and x\ — '-^ and xi — ^1^1 are two probabilistic bisectors. In such a case, 
two probabilistic bisectors, X\ and Xi, divide the space into three subspaces. That means, the Voronoi cell of object 
o\ comprises of two disjoint subspaces. In Figure |3je), the subspace left to X\ and the subspace right to X2 form the 
Voronoi cell of o\, and the subspace bounded by X\ and X2 forms the Voronoi cell of 02. 

Apart from the above mentioned scenarios, the remaining scenarios of two overlapping non-equi-range objects 
are shown in Figure|4] where it is not possible to compute the probabilistic bisector directly by using lower and upper 



bounds of two candidate objects. In these scenarios, Lemma 4.3 can be used for choosing a point, called the initial 
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probabilistic bisector, which approximates the actual probabilistic bisector and thereby reducing the computational 
overhead. Figure|4](a), (b), (c) show three scenarios, where three cases of Lemma |4.3| (l), (2), (3), are used to compute 
the initial probabilistic bisector, respectively, for our algorithm. We will see (in Algorithm[T]i how to use our lemmas 
to find the probabilistic bisectors for these scenarios. 



^2 



Zj x 2 U 1 
i 



(a) 



(b) 



ojn. 



4*^ 



(c) 



Figure 4: Remaining scenarios 



So far we have assumed that no other objects exist within the ranges of two candidate objects. However, the 
probabilities of two candidate objects may change in the presence of any other objects within their ranges (as shown 
in Equation[T]i. Only the probabilistic bisector of two equi-range objects remains the same in the presence of any other 
object within their ranges. 

Let Ok be the third object that overlaps with the range [min(li,lj),max(ui,Uj)] for the case in Figure |3ja). Then, 
using Equation[T[ we can calculate the NN probability of object o,- from x as follows. 



Oi, x) = V — 



1 tlj - srik- s 



in 

Similarly, we can calculate the NN probability of object Oj from x as follows. 

1 fli — S Uk — s 



p(Oj, x) = V — — 



Since n, = nj, we have p(pt, x) = p(oj, x) and pb 0j0j = x. Therefore, the probabilistic bisector pb 0j0j does not 
change with the presence a third object. 

Therefore, for scenarios, except for the case when two candidate objects are equi-range, when any other object 
exists within the ranges two candidate objects, we again use one of the Lemmas 4.1|4.3 to compute the initial proba- 



bilistic bisector, and then find the actual probabilistic bisector. For example, if two non-equi-range candidate objects 
do not overlap each other (see Figure [3jb)) and a third object exists, which is not shown in figure, within the range 



of these two candidate objects, then we use Lemma 4.2 to find the initial probabilistic bisector. Similarly, we choose 
the corresponding lemmas for other scenarios to compute initial probabilistic bisectors. Then we use these computed 
initial probabilistic bisectors to find actual probabilistic bisectors. 

The position of a probabilistic bisector depends on the relative positions and the uncertainty regions of two can- 
didate objects. We have shown that for some scenarios the probabilistic bisectors can be directly computed using the 
proposed lemmas. In some other scenarios, there is no straightforward way to compute probabilistic bisectors. For this 
latter case, the initial probabilistic bisector of two candidate objects is chosen based on the actual probabilistic bisector 
of the scenario that can be directly computed and has the most similarity (relative positions of candidate objects) with 
two candidate objects. This ensures that the initial probabilistic bisector is essentially close to the actual probabilistic 
bisector. 
Algorithms: 
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Algorithm 1: ProbBisectoiTD(o,,Oj, O) 



4.1 



4.3 



then 



1.1 pb 0i0j ^ 

1.2 if Oj and Oj satisfy one of the Lemmas 

1.3 | pb 0j0j <— BisectorBasedOnLemmas(oj,Oj,0) 

1.4 else 

1.5 

1.6 

1.7 

1.8 

1.9 
1.10 
1.11 
1.12 
1.13 
1.14 
1.15 
1.16 
1.17 
1.18 



1.19 
1.20 
1.21 



4.1 



4.3 



(any pair of objects that does not satisfy Lemmas 
if Oj and o j do not overlap then 
ipb *— Bisector(m h nij) 
pb 0i oj *~ FindProbBisectorlD(pi, Oj, ipb, O) 

else 

(three possible cases for overlapping pairs of objects (Figure[4|) 
(assume Z, < lj (the other case /, > / ; is symmetric)) 
if/, <lj and iij < u, then 
ipbi <— Bisector(l h lj) 
ipb^ <— Bisector{u h uf) 

pb 0i oj FindProbBisectorlD(Oj, Oj, ipb\ , O) U FindProbBisector\D(oj, Oj, ipb2, 
else if lj - /, < iij - it, then 
ipb <— Bisector(nij, W;) 

pboioi <- FindProbBisectorlD(o h Oj, ipb, O) 

else 

ipb <— Bisector(nii, lj) 

pb Bj0 <— FindProbBisectorlD(oj,Oj,ipb,0) 



O) 



1.22 return /j£> 0iD/ ; 



Based on the above lemmas, Algorithm [Tj summarizes the steps of computing the probabilistic bisector pb 0j0j for 



any two objects o, and Oj, where O is a given set of objects and o,-, oj e O. If o, and o 7 - satisfy any of Lemmas 



4.1 



4.3 the algorithm directly computes pb„ i0j (Lines |TJ2- |TJ3). Otherwise, if any other object exists within the range of 



two candidate non-equi-range objects o, and oj, or two candidate non-equi-range objects fall in any of the scenarios 
shown in Figure [4] The algorithm first computes an initial probabilistic bisector ipb using our lemmas, where the 
given scenario has the most similarity in terms of relative positions of candidate objects to the corresponding lemma. 
Then, the algorithm uses the function FindProbBisectorlD to find pb 0j0 by using ipb as a base. 

After computing the ipb the algorithm calls a function FindProbBisectorlD to find the probabilistic bisector 
pb . 0j (Lines[l]8,[T]l5,[T]l8, and[T]21). 

The function FindProbBisectorlD computes pb 0j0j by refining ipb. If the probabilities of o, and oj of being the 
NN from ipb are equal, then the algorithm returns ipb as the probabilistic bisector. Otherwise, the algorithm decides 
in which direction from ipb it should continue the search for pb 0j0 .. Let x = ipb. We also assume that o, is left to Oj. If 
p{Pi, x) is smaller than p(oj, x), then pb 0j0] is to the left of x and within the range [minili, lj), x], otherwise pb 0;0j is to 
the right of x and within the range [x, max(lj, lj)]. Since using lemmas, we choose ipb as close as possible to pb 0i0j , in 
most of the cases the probabilistic bisector is found very close to the position of ipb. Thus, as an alternative to directly 
running a binary search within the range, one can perform a step-wise search first, by increasing (or decreasing) the 
value of x until the probability ranking of two objects swaps. Since the precision of probability measures affects the 
performance of the above search, we assume that the two probability measures are equal when the difference between 
them is smaller than a threshold. The value of the threshold can be found experimentally given an application domain. 

Finally, Algorithm|2]shows the steps for computing a PVD for a set of ID uncertain objects O. In ID data space, 
the PVD contains a list of bisectors that divides the total data space into a set of Voronoi cells or ID ranges. The basic 
idea of Algorithm [2] is that, once we have the probabilistic bisectors of all pairs of objects in a sorted list, a sequential 
scan of the list can find the candidate probabilistic bisectors that comprise the probabilistic Voronoi diagram in ID 
space. 

To avoid computing probabilistic bisectors for all pairs of objects o,-,o ; - € O, we use the following heuristic: 
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Algorithm 2: ProbVoronoilD(<9) 

2.1 PVD <- 

2.2 PBL^% 

2.3 SortObjects(0) 

2.4 for eac/i o, 6 do 

2.5 o ; <— getNext(O) pb Bj0j «- ProbBisector\D(o h Op O) PBL <— PBL U pZ) D;0 . 

2.6 iV «— getCandidate0bjects(0,pb BjOj ) 

2.7 for eac/z <?t e jV do 

2.8 po„. ot «- ProbBisectorlD(o h o k , O) PBL «- PBL U pfc 0f0jt 

SPBL «- SortProbBisectors(PBL) 
o' «— initialMostProbableObjectQ 
while 5 PBL is no/ empty do 

P*V ; «- popNextPB(SPBL) 
left <- LeftS ideObjectQ 
right <— RightS ideOb jectQ 
if PVD is empfy OB o' = fe/f then 

PVD «- PVD U ProbBisector(pb OIO] , o u oj) 
o' *— n'g/if 

else 

Discard(pb 0i0j ) 
2.20 return PVD; 



2.9 
2.10 
2.11 
2.12 
2.13 
2.14 
2.15 
2.16 
2.17 
2.18 
2.19 



Heuristic 4.1. Lef o,- foe an object in the ordered (in ascending order ofU) list of objects O, and Oj be the next object 
right to Oj in O. Let x — pb 0j0j , and d = dist(x, If)- Let o^ be an object in O. If dist(x, 1%) > d, then the probabilistic 
bisector pb 0j0k of Oi and ot is x' , and x' is to the right of x, i.e., x' > x; therefore pb 0l0k does not need to be computed. 

Algorithm [2] runs as follows. First, the algorithm sorts all objects in ascending order of their lower bounds 
(Line |2]3). Second, for each object o,, it computes probabilistic bisectors of o, with the next object oj e O and 
with a set of objects returned by the function getCandidateOb jects based on Heuristic |4. 1 1 (Lines |2|4-|2|8). PBL 
maintains the list all computed probabilistic bisectors. Third, the algorithm sorts the list PBL in ascending order of the 
position of probabilistic bisectors and assigns the sorted list to SPBL (Line|2]9). Finally, from S PBL, the algorithm 
selects probabilistic bisectors that contribute to the PVD (Lines[2]l0-[2]19). For this final step, the algorithm first finds 
the most probable NN o' with respect to the starting position of the data space. Then for each pb 0i0 € SPBL, the 
algorithm decides whether pb 0j0j is a candidate for the PVD (Lines [2]ll-|2] 19). We assume that o, is the left side 
object and oj is the right side object of the probabilistic bisector. If o — Oj, then pb 0j0j is included in the PVD, and 
o' is updated with the most probable object on the right region of pb 0j0j (Line [2] 17). Otherwise, pb 0i0j is discarded 
(Line|2]l9). This process continues until SPBL becomes empty, and the algorithm finally returns PVD. 

The proof of correctness and the complexity of this algorithm are provided as follows. 

Correctness: Let SPBL be the list of probabilistic bisectors in ascending order of their positions. Let o' be the 
most probable NN with respect to the starting point / of the ID data space. Let pb 0j0] be the next probabilistic bisector 
fetched from SPBL. Now we can have the following two cases: (i) Case 1: o' = o,. The probability p, of o, being 
the nearest is the highest for all points starting from I to pb 0i0j and the probability pj of Oj being the nearest is the 
highest for points on the right side of pb 0j0 . until the next valid probabilistic bisector is found. Hence, pb 0i0j is a valid 
probabilistic bisector and is added to the PVD. Then the algorithm updates o' by oj since oj will be the most probable 
on the right of pb 0j0j and will be on the left region of the next valid probabilistic bisector, (ii) Case 2: o' + o,. Let 
us assume that p t > p' at pb 0j0j . We already know that p' > p, at the starting point I. So there should be some point 
within the range [l,pb 0j0j ] where p' = p h which is the position of the probabilistic bisector of o' and o,. Since no 
such bisector is found within this range, p, > p' is not true at pb 0j0j . Thus, p' is the highest even at pb 0j0j , and will 
remain the highest until it fetches another pb 0jl0/ from SPBL, where o' = o,v. The above process continues until the 
algorithm reaches the end of the data space. 
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Complexity: The complexity of Algorithm [2] can be determined as follows. Let Cb be the cost of computing the 
probability of an object being the NN of a query point, and C p b be the cost of finding the probabilistic bisector of 
two objects. The complexity of Algorithm [2] is dominated by the complexity of executing the Lines [2]4-|2]8, which 
is OinNCpt), where n is the total number of objects, and N is the expected number of probabilistic bisectors that 
need to be computed for each object in O. For real data sets, N is found to be a small value since each object has 
a small number of surrounding objects (in the worst case it can be n - 1). The cost of C p b = 0(Cb log 2 D), where 
D is the expected distance between our initial probabilistic bisector ipb and the actual probabilistic bisector. This is 
because, the cost of finding a probabilistic bisector is (9(1) for the cases when our algorithm can directly compute 
the probabilistic bisector, and for other cases our algorithm first finds ipb by 0(1) and then searches for the actual 
probabilistic bisector using FindProbBisectorlD by (9(logD). 

4.2. Probabilistic Voronoi Diagram in a 2D Space 

In location-based applications, locations of objects such as a passenger and a building, in a 2D space can be 
uncertain due to the imprecision of data capturing devices or the privacy concerns of users. In these applications, the 
location of an object o,- can be represented as a circular region = (c,, rj), where c,- is the center and r, is the radius of 
the region, and the actual location of o, can be anywhere in Rj. The area of o, is expressed as A, = nrf. In this section, 
we derive the PVD for 2D uncertain objects. 

Similar to the ID case, a naive approach to find the probabilistic bisector pb 0i0j of o, and Oj requires an exhaustive 
computation of probabilities using Equation[T]for every position in a large area. In our approach, we first show that we 
can directly compute pb 0j0 as the bisector bs CjC of c, and cj when two candidate objects are equi -range (i.e., r, = rj). 
Next, we show that for two non-equi-range objects (i.e., r,- + rj), depending on radii and relative positions of objects 
pb 0i0j slightly shifts from bs c . Cj . In this case, we use bs c . c . to choose a line, called the initial probabilistic bisector, 
to approximate the actual probabilistic bisector pb 0j0j . Although for simplicity of presentation, we will use examples 
where two candidate objects are non-overlapping, Lemmas 4.4J4.7 also hold for overlapping objects. 



For two equi-range uncertain circular objects o,- and oj, we have the following lemma: 

Lemma 4.4. Let o, and oj be two circular uncertain objects with uncertain regions (c,-, rj) and (Cj, rj), respectively. If 
r { — r j, then the probabilistic bisector pb . 0j of Oi and oj is the bisector bs CjC . of Cj and cj. 

Proof. Let x be any point on bs CjC ., and d = mindist(x, o,)(or mindist(x, oj)). Let there be no other objects within the 
circular range centered at x with radius d + 2r, . Suppose circles centered at x with radii d + 1 to d + 2r, partition o, 
into 2r,- sub-regions o,-, , o,-, , o,- 2r , such that Y^li ~a~ ~ ^ • Similarly, oj is divided into 2r,- sub-regions Oj x , oj 2 Oj lr , 
where -f- = 1- By using Equation jlj we can calculate the probability of o,- being the nearest from x, as follows. 

s=d+l A ' u=d + l A l 

Similarly, we can calculate the probability of oj being the nearest from x, as follows. 

2r,+d s 
S =d+1 A J u=d + l A > 

Since, r, = r, and o, s = oj s for all 1 < s < 2r t , we have p{oi,x) = p(oj,x). 

The probabilistic bisector pb 0l „ 2 of two equi-range objects o\ and 02 is shown in Figure|5] 



Lemmas 4.5 and 4.6 show how the probabilistic bisector of two non-equi-range objects o,- and oj is related to the 
bisector of c, and cj (Figure |6]and 7ji^ 



Next, we will show in Lemma 



4.5 that the shape of pb . 0j for two non-equi-range circular objects o, and oj is a 



curve, and the distance of this curve from bs CjCj is maximum on the line c,c ; . Figure |6] shows the bisector bs CiC , and 



the probabilistic bisector pb ou)n for o\ and 02. 
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Figure 5: The probabilistic bisector of objects o\ and 02, where r\ = ri 



1 

j x 




Figure 6: The probabilistic bisector of objects 01 and 02, where r\ > ri. The curve, pb oll?2 , is the probabilistic bisector between 01 and 02, i.e., 
p(oi, x) = p(02, x), for any point x e {pb oia ^ } 



Lemma 4.5. Let o,- and oj be two objects with non-equi- range uncertain circular regions (c,-, r,) and (cj, rj), respec- 
tively, and bs c . c . be the bisector of Ci and cj. Then the maximum distance between bs c . Cj and pb 0j0j occurs on the line 
cjcj. This distance gradually decreases as we move towards positive or negative infinity along the bisector bs CiCj . 

PROOF. Let x = ^p- be the intersection point of bs CjC . and c,-cy. Suppose a circle centered at x with radius Jfl^EA 
divides o, into o,-, and o,-,, where -jr + = L and Oj into Oj i and oj 2 , where -j^ + °j- - 1. According to curvature 
properties of circles, since r, > rj, we have ^- < (in Figure |6 < ^-), which intuitively means, oj is a more 
probable NN than o, to x, i.e., p(oj, x) > p(o h x). Thus, x needs tobe shifted to a point towards c, (along the line xc]), 
such that the probabilities of o, and oj being the NNs to the new point become equal. 

Suppose a point x' is on bs CjCj at the positive infinity. If a circle centered at x' goes through the centers of both 
objects o, and oj, then the curvature of the portion of the circle that falls inside an object (o, or of) will become a 
straight line. This is because, in this case we consider a small portion of the curve of an infinitely large circle. This 
circle divides both objects o, and oj into two equal parts o,-, = o„ and Oj x = Op , respectively. Thus, the probabilities 
of o, and oj being the NNs will approach to being equal at positive infinity, i.e., p(pj, x') « p{o i7 x'), for a large values 
of dist(x', x). Similarly, we can show the case for a point x" at the negative infinity on bs CjC . (see Figure|6jl. 



Next, we show in Lemma 



4.6 



that pb 0j0j shifts from bs c . c . towards the object with larger radius, and the distance 
of pb 0j0 . from bs c . c . widens with the increase of the ratio of two radii (i.e., r, and rf). Figure [7] shows an example of 
this case. 

Lemma 4.6. Let Oi and Oj be two objects with non-equi-range uncertain circular regions (c,-, r,) and (cj, rf), respec- 
tively, and x — °-^y l be the midpoint of the line segment cjcj. If r; > rj, then the probabilistic bisector pb 0i0j meets 
CjCj at point x', where x' lies between x and c,-. If the circular range of Oi increases such that /■ > r,-, then the new 
probabilistic bisector pb' . meets c~jcj at point x" , where x" lies between x and C/, and dist(x, x') < dist(x, x"). 
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PROOF. Suppose a circle centered at x with radius " f c jj c ' divides o, into o,, and o„, where + ^ = 1, and o ; into 
oj i and o j2 , where 7 1 + 7 1 = 1. According to curvature properties of circles, since r, > rj, we have < -jf- (in 

Figure^ ^- < ^-), which intuitively means, o 7 is a more probable NN than o, to x, i.e., p(oj,x) > p(oi,x). Thus, jc 
needs to be shifted to a point x' towards c,, such that the probabilities of o,- and Oj being the NN to x' become equal. 
Let o' t be an object, such that r' > r, and d = c,. Then the circle centered at x with radius d "" ( ^" c ^ divides o' i into of. 

o'. o'. o ■'■ o o _ 

and oJ 2 , where -jf + ^ = 1, Now, we have -^f < < -j 1 - Thus, x needs to be shifted to a point jc" more towards c,-, 
i.e., distix, x') < dist(x, x"), such that the probabilities of oj and oj being the NNs become equal at x". 




Figure 7: Influence of objects' sizes on the probabilistic bisector 



The next lemma shows the influence of a third object on the probabilistic bisector of two non-equi-range objects. 
(Note that the probabilistic bisector of two equi-range objects does not change with the influence of any other object 
(see Lemma 4.4 1). Figure [8] shows an example, where object 03 influences the probabilistic bisector of objects o\ and 



02. In this figure, the dotted circle centered at s\ with radius dist(s\,c\) + r\ encloses one candidate object 01, but only 
touches the third object 03. Thus, the probability of 03 being the NN to s\ is zero. However, for any point between s\ 
and S2, 03 has a non-zero probability of being the NN of that point, and thus 03 influences pb 0102 . 

Lemma 4.7. Let o t and Oj be two objects with non-equi-range uncertain circular regions (c,-, r,) and (cj, rj), respec- 
tively, where r,- < rj, and bs CjCj be the bisector of Ct and c j. An object Ok influences the probabilistic bisector pb 0j0j for 
the part of the segment [si, S2] on the line bs CjCj , where dist(s, c,) + r; > dist(s,Ck) — r^for s e bs CiCj . 

Proof. Since r, < rj, we have maxdistis, o,) < maxdist(s, oj). Thus, if the minimum distance mindist(s, Ok) of an 
object Ok from s is greater than the maximum distance maxdist(s, oj) of Oj from s, i.e., dist(s, q) - r^ > dist(s, c,) + r,-, 
the object o^ cannot be the NN to the point s, otherwise o^ has the possibility of being the NN to s and hence Ok 
influences pb 0j0j . 

It is noted when the centers of two non-equal objects coincide each other, the probability of the smaller object 
dominates the probability of the larger object. Therefore, in those cases, we only consider the object with a smaller 
radius, and the other object is discarded. Also, if two objects are equal and their centers coincide each other, no 
probabilistic bisector exists between them, thus any one of these two objects is considered for computing the PVD. 
Algorithms: 

Based on the above lemmas, we propose algorithms to find the probabilistic bisector of any two uncertain 2D 



objects. We have shown in Lemma 4.4 that the probabilistic bisector of two circular uncertain objects is a straight line 



when the radii of two objects are equal. On the other hand, Lemma 4.5 Lemma 4.6 show that the probabilistic bisector 
is a curve when the radii of two objects are non-equal. However, to avoid the computational and maintenance costs, 
we maintain a bounding box (i.e., quadrilateral) that encloses the actual probabilistic bisector of two objects. Hence, 
we name the probabilistic bisector of two circular objects as the Probabilistic Bisector Region (PBR). For example, 
the bounding box that encloses the curve in Figure|6]is the PBR for two objects o\ and 02. In our algorithm, we first 
create an ordinary Voronoi diagram by using the centers of all uncertain objects. Then, from each Voronoi edge ey 
(i.e., bs CjC ) of two objects o, and oj, we compute the PBR that encloses pb 0j0 . 

Algorithm [3] computes the probabilistic bisector of two equi-range objects according to Lemma |4~4| (Line |3fe). 
Otherwise, it calls the function FindProbBisectorlD to determine pb 0i0j for two non-equi-range objects o, and Oj. 
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Figure 8: Influence of object 03 on the probabilistic bisector of o\ and 01 



Algorithm 3: ProbBisector2D(o,, Oj, ey, O) 

3.1 if r, = rj then 

3.2 | pb 0j0j <— ProbBisector(oi,Oj,ejj) 

3.3 else 

3.4 I pb 0j0j <— FindProbBisector2D(o h Oj, ey, 0) 

3.5 return piw ; 



The function FindProbBisectorlD (see Algorithm]?]) takes two non-equi-range objects o,-, Oj, the bisector e, ; (i.e., 
a Voronoi edge) of c,- and Cj, and the set of objects O as input, and returns pbr for pb 0j0j . The algorithm finds lower 
(/vaZ) and upper (uval) bounds representing the required deviations of the probabilistic bisector from the bisector 
of c, and cj, such that the PBR can be computed by drawing two lines parallel to ey at Ival and uval, respectively. 
Algorithm|4]iirst initializes ipb with the intersection point of ey and c,-cy (Line|4]l). Then, the function InitPBRBound 
computes initial lower (Ival) and upper (uval) bounds of pbr (Line[4]2). This function first determines a point x' on 
the line cjcj where p(o,,x') as p(oj,x') (We use a similar search technique as described for the ID space). If xf is to 



Algorithm 4: FindProbBisector2D(o,, o j5 ey, (9) 

4.1 <— Inter sect(eij, CjCj) 

4.2 InitPBRBoundQval, hval, ipb) 

4.3 /L «— FindInfluencedPart(Oj, Oj, e^) 

4.4 for eac/i /s e /L do 

4.5 ^ 1/ pdatePBRBound(lval, hval, Is) 

4.6 pfer «— [/va/, Ava/] 

4.7 return pfcr; 



the left of <?y, then Zva/ and hval are set to x' and x, respectively. On the other hand, if x' is to the right of <?y, then 
Ival and hval are set to x and x', respectively. After that, the function FindlnfluencePart finds a list IL that contains 
different segments of the bisector ey, where other objects influence pb 0i0j (see Lemma 4.7 1. The function returns IL as 



an empty list when no other object influences the probabilistic bisector. In that case the current Ival and hval defines 
pbr. In Lemma 4.6 we have seen that the maximum distance of pb 0i0 . from the bisector of c, and cj is on the line 
CjCj. Thus, the initially computed pbr encloses the curve of pb 0i0j . On the other hand, if IL is not empty, then for 
each line segment Is e IL, the function U pdatePBRBound is called to update Ival and hval based on the influence 
of other objects. As Ival and hval represent the deviation of pb 0j0 . from ey, we need compute the deviations for each 
line segment Is, and then take the minimum of all Ivah and the maximum of all hvals to compute the pbr. To avoid 
a brute-force approach of computing Ival and hval for every point of an Is e IL, we compute Ival and hval for two 
extreme points and the mid-point of Is. Finally, the algorithm returns pbr for pb 0j0j . 

Algorithm|5]shows the steps of ProbVoronoi2D that computes PVD for a given set O of 2D objects. In Line[5]2, 
the algorithm first creates a Voronoi diagram VD for all centers c, of objects o, 6 O using 11221 . Then for each Voronoi 
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Algorithm 5: ProbVoronoi2D(0) 

5.1 PVD <- 

5.2 VD «— VornoiDiagramOfCentroids(O) 

5.3 for eac/; Vtwnoi edge ejj 6 VD, where Oj,Oj € do 

5.4 [_ PVD «- PV£> U ProbBisector2D(oi, o } , e tj , O) 

5.5 return PVD; 



edge e,j between two objects o, and o ; , the algorithm calls the function ProbBisector2D to compute the probabilistic 
bisector as PBR between two candidate objects, and finally it returns the PVD for the given set O of objects. 



PVC( 0l ) 



pbr PVC(o 2 




Figure 9: The PVD of three objects o\, 02, and 03 

Figure [9] shows the PVD for objects o\, 02, and 03. In this figure, PVC(o\), PVCioj), and PVC(o^) represent the 
PVCs for objects o\, 02, and 03, respectively. The boundaries between PVCs, i.e., PBRs of objects, pbr 0l02 , pbr 0l0} , 
and pbr 0l0l , are shown using grey bounded regions. For any point inside a PVC, the corresponding object is guaranteed 
to be the most probable NN. On the other hand, for any point inside a PBR, any of the two objects that share the PBR 
can be the most probable NNs. If more than two PBRs intersect each other in a region, any object associated with 
these PBRs can be the most probable NN to a query point in that region. Figure [9] shows a dark grey region where 
pbr 0l02 , pbr 0l03 , and pbr 0l „ 3 meets. 

Complexity: The complexity of Algorithm [5] can be estimated as follows. The complexity of creating a Voronoi 
diagram (Line B" 
bisectors (Lines 



2) is 0(n\ogn) [22], where n is the number of objects. The complexity of finding probabilistic 
5]3j5]4) is 0(n e C p b), where n e is the number of Voronoi edges and C p i, is the expected cost of 
computing the probabilistic bisector between two circular objects. For real data sets, n e is expected to be a small 
integer since an object has only a small number of surrounding objects. The total complexity of the algorithm is 
0(n log n)+0(n e C p b). C p b can be estimated as follows. Let Cb be the cost of computing the probability of an object 
being the NN of a query point, D be the expected distance between the initial probabilistic bisector ipb and the actual 
probabilistic bisector, and L be the expected number of points in the bisector that needs to be considered to find upper 
and lower bounds of the probabilistic bisector. Then we have C p b = OiLCb log 2 D). This is because, the cost of finding 
a probabilistic bisector is 0(1) for the cases when our algorithm can directly compute the probabilistic bisector, and 
for other cases our algorithm first finds ipb by 0(1) and then search for the actual probabilistic bisector by using 
Algorithm [4] by O(LlogD). Note that, for both ID and 2D, the run-time behavior of our algorithm is dominated by 
those cases for which there is no closed form for a given probabilistic bisector, i.e., the algorithm needs to search for 
the bisector by using the initial probabilistic bisector. 



4.3. Discussion 

PVD for Other Distributions: In this paper, we assume the uniform distributions for the pdf of uncertain objects to 
illustrate the concept of the PVD. However, the pdf that describes the distribution of an object inside the uncertainty 
region can follow arbitrary distributions, e.g., Gaussian. The concept of PVD can be extended for any arbitrary distri- 
bution. For example, for an object with Gaussian pdf having a circular uncertain region, the probability of the object 
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of being around the center of the circular region is higher than that of the boundary region of the circle. For such dis- 
tributions, a straightforward approach to compute the probabilistic bisector between any pair of objects is as follows. 
First, we can use the bisector of the centroids of two candidate objects as the initial probabilistic bisector. Then we 
can refine the initial probabilistic bisector to find the actual probabilistic bisector. Finding suitable initial probabilistic 
bisectors for efficient computation of probabilistic bisectors (e.g., lemmas for different cases for ID and 2D data sets 
similar to the uniform pdf) for an arbitrary distribution is the scope of future investigation. 

PVD for Higher Dimensions: We can compute the PVD for higher dimensional spaces, similar to ID and 2D spaces. 
For example, in a 3D space, an uncertain object can be represented as a sphere instead of a circle in 2D. Then, the 
probabilistic bisector of two equal size spheres will be a plane bisecting the centers of two spheres. Using this as a 
base, similar to 2D, we can compute the PVD for 3D objects. We omit a detailed discussion on PVDs in spaces of 
more than 2 dimensions. 

Higher order PVDs: In this paper, we focus on the first order PVD. By using this PVD, we can find the NN for a given 
query point. Thus, the PVD can be used for continuously reporting 1-NN for a moving query point. To generalize 
the concept for A:NN queries, we need to develop the £-order PVD. The basic idea would be to find the probabilistic 
bisectors among size-fc subsets of objects. The detailed investigation of higher order PVDs is a topic of future study. 
Handling Updates: To handle updates on the data objects, like traditional Voronoi diagrams, a straightforward ap- 
proach is to recompute the entire PVD. There are algorithms |23, 24| to incrementally update a traditional Voronoi 
diagram. Similar ideas can be applied to the PVD to derive incremental update algorithms. We will defer such incre- 
mental update algorithms for future work. 

It is noted that, to avoid an expensive computation of the PVD for the whole data set and to cope with updates for 
the data objects, we propose an alternative approach based on the concept of local PVD (see Section 5.5.2). In this 
approach, only a subset of objects that fall within a specified range of the current position of the query is retrieved 
from the server and then the local PVD is created for these retrieved objects to answer PMNN queries. If there is 
any update inside the specified range, the process needs to be repeated. Since, this approach works only with the 
surrounding objects of a query, updates from objects that are outside the range do not affect the performance of the 
system. 

5. Processing PMNN Queries 

In this section, based on the concept of PVD we propose two techniques: a pre-computation approach and an 
incremental approach for answering PMNN queries. In the pre-computation approach, we first create the PVD for the 
whole data set and then index the PVCs for answering PMNN queries. We name the pre-computation based technique 
for processing PMNN queries as P-PVD-PMNN. On the other hand, in the incremental approach, we retrieve a set of 
surrounding objects with respect the current query location and then create the local PVD for these retrieved data set, 
and finally use this local PVD to answer PMNN queries. We name this approach I-PVD-PMNN in this paper. 

5.1. Pre-computation Approach 

In the pre-computation approach, we first create the PVD for all objects in the database. After computing the PVD, 
we only need to determine the current Probabilistic Voronoi Cell (PVC), where the current query point is located. The 
query evaluation algorithm can be summarized as follows. 

Initially, the query issuer requests the most probable NN for the current query position q. After receiving the 
PMNN request for q, the server algorithm finds the current PVC to which the query point falls into using a function 
IdentifyPVC and updates cpvc with the current PVC. The algorithm reports the corresponding object p as the most 
probable NN and the cell cpvc to the query issuer. Next time when q is updated at the query issuer, if q falls inside 
cpvc, no request is made to the server as the most probable NN has not been changed. Otherwise, the query issuer 
again sends the PMNN request to the server to determine the new PVC and the answer for the updated query position. 

As the PVD in a ID space contains a set of non-overlapping ranges representing PVCs for objects, the algorithm 
returns a single object as the most probable NN for any query point. On the other hand, in a 2D space, the boundary 
between two PVCs is a region (i.e., PBR) rather than a line. When a query point falls inside a PBR, the algorithm can 
possibly return both objects that share the PBR as the most probable NNs, or preferably can decide the most probable 
NN by computing a top-l-PNN query. (Since, for a realistic setting a PBR is small region compared to that of PVCs, 
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Figure 10: (a) The PVD, and (b) the MBRs of PVCs for objects 01-03 



our approach incurs much less computational overhead than that of the sampling based approach for processing a 
PMNN query.) 



Figure 10 a) shows that when the query point is at q', PVD-PMNN returns 03 as the most probable NN as q' falls 
into PVCX03). When the query point moves to q" , the algorithm returns 02 as the answer. 

A naive approach of identifying the desired PVC (i.e., IdentifyPVC function) requires an exhaustive search of 
all the PVCs in a PVD, which is an expensive operation. Indexing Voronoi diagrams [25, 26] [27) is an well-known 
approach for efficient nearest neighbor search in high-dimensional spaces. Thus, for efficient search of the PVCs, 
we index the PVCs of the PVD using an /T-tree 11281 , a variant of the R-tree ||29l l27l . In a ID space, each PVC is 
represented as a ID range and is indexed using a ID R*-tree. Since there is no overlap among PVCs, a query point 
always falls inside a single PVC, where the corresponding object is the most probable NN to the query point. On the 
other hand, in a 2D space, each PVC cell is enclosed using a Minimum Bounding Rectangle (MBR), and is indexed 
using a 2D /T-tree. Since the MBRs representing PVCs overlap each other, when a query point falls inside only a 
single MBR, the corresponding object is confirmed to be the most probable NN to the query point. However, when a 
query point falls inside the overlapping region of two or more MBRs, the actual most probable NN can be identified 
by checking the PVCs of all candidate MBRs. Figure [TOfo) shows the MBRs [B u B 2 , B 3 , B 4 ], [B s , B 6 , B 7 , Bg], and 
[B(>, Bio, B\i, B12] for the PVCs of objects o\, 02, and 03, respectively. In this example, the query point q' intersects 
both [Bg, Z?6, B7, B&] and [Bg, Bio, Bu, B12], and the actual most probable NN 03 can be determined by checking the 
PVCs of 03 and 02; on the other hand, the query point q'" only intersects a single MBR [B 5 , B^, B 7 , Bg], so the 
corresponding object 02 is the most probable NN to q'" . 

Since the above approach only retrieves the current PVC of a moving query point, it needs to access the PVD using 
the B*-tree as soon as the query leaves the current PVC. This may incur more I/O costs than what can be achieved. 
To further reduce I/O and improve the processing time, we use a buffer management technique, where instead of 
only retrieving the PVC that contains the given query point, we retrieve all PVCs whose MBRs intersect with a given 
range, called a buffer window, for a given query point. These PVCs are buffered and are used to answer subsequent 
query points of a moving query. This process is repeated for a PMNN query when the buffered cells cannot answer 
the query. 

Since the creation of the entire PVD is computationally expensive, the pre-computation based approach is justified 
when the PVD can be re-used which is the case for static data, or when the query spans over the whole data space. 
To avoid expensive pre-computation, next, we propose an incremental approach which is preferable when the query 
is confined to a small region of the data space or when there are frequent updates in the database. 



5.2. Incremental Approach 

In this section, we describe our incremental evaluation technique for processing a PMNN query based on the 
concept of known region and the local PVD. Next, we briefly discuss the concept of known region, and then present 
the detailed algorithm of our incremental approach. 

Known Region: Intuitively, the known region is an explored data space where the position of all objects are known. 
We define the known region as a circular region that bounds the top-A: probable NNs with respect to the current query 
point (i.e., the center point of the region). For a given point q s , the server expands the search space to incrementally 
access objects in the order of their mindist from q s until it finds top k probable nearest neighbors with respect to q s 
(we use existing algorithm [8] to find top-fcNN). Then the known region is determined by a circular region centered at 
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q s that encloses all these k objects. Figure 1 1 shows the known circular region K(q s , r) using a dashed circle, where 
k — 3. Then the radius r of this known area is determined by max(maxdist{q s ,o\),maxdist{q s ,02),maxdist(q s ,o^)). 
In this example, top-3 most probable nearest neighbors are 01,02, and 03. 



K(q s ,r) 





Figure 11: Known region and objects o\ , 02, and 03 



The key idea of incremental approach is to consider only a sub-set of objects surrounding the moving query point 
while evaluating a PMNN query. For example, in a client-server paradigm, the client first requests the server for 
objects and the known region by providing the starting point of the moving query path as a query point. Then the 
client locally creates a PVD based on the retrieved objects, and uses the local PVD for answering the PMNN query. 
This process needs to be repeated as soon as the user's request for the PMNN query cannot be satisfied by the already 
retrieved data at the client. Though this incremental approach applies to both centralized and client-server paradigms, 
without loss of generality, next we explain how to incrementally evaluate a PMNN query in the client-server paradigm. 

Algorithm: After retrieving a set of objects from the server, the client locally computes a PVD for those objects. 
Then, the client can use the local PVD to determine the most probable nearest neighbor among the objects inside the 
known region. However, since the client does not have any knowledge about objects that are outside of the known 
region, the most probable nearest neighbor based on the local PVD formed for objects inside the known region, might 
not guarantee the most probable nearest neighbor with respect to all objects in the database. This is because, a PVC 
of the local PVD determines the region where the corresponding object is the most probable NN with respect to 
objects inside the known region. However, certain locations of the PVC can have other non-retrieved objects, which 
are outside the known region, as the most probable NN. Thus, we need to determine a region in the PVC for which 
the query result is guaranteed. That is, all locations inside this guaranteed region will have the corresponding object 
as the most probable NN. To define the guaranteed region for an object, we have two conditions. 

Let q be a query point and o, be an object inside the known region. Then, if the query point q is inside a PVC cell 
of object o, and the condition in the following equation (see Equation [2| holds, then it is ensured that o, is the most 
probable NN among all objects in the database. 

maxdist(q, c,) < r — dist(q, q s ). (2) 

The condition in Equation [2] ensures that no object outside the known region can be the nearest neighbor for the 
given query point. This is because, when a circle centered at q completely contains an object, all objects outside this 
circle will have zero probability of being the NN to q. 

To formally define a region based on the above inequality, we re-arrange Equation|2]as follows. 



dist(q, Ci) + r,- < r — dist(q, q s ) 
=> dist(q, Cj) + dist(q, q s ) < r — r, 

We can see that the boundary of the above formula forms an elliptic region in a 2D Euclidean space, where 
the two foci of the ellipse are q s and c,. i.e., the sum of the distances from q s and c, to any point on the ellipse is 
r - r ; . Figure 1 1 shows an example, where the elliptical region for object 02 is shown using dashed border. Figure 1 1 
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shows that when the query point is at q\, the object 02 is confirmed to be the most probable nearest neighbor, as 
dist(q\,C2) + i~2 < r — dist(q\,q s ). 

From the above discussion, we see that for an object o,, the intersection region of the PVC and the elliptical region 
for o; forms a region where all points in this region has o,- as the most probable NN. Figure 12 shows the PVD and 
elliptical regions for objects o\, 02, and 03, and a moving query path from q' to q'" . In this figure, since q' is inside 
the intersection region of PVC(o{) and elliptical region of 03, thus 03 is guaranteed to be the most probable NN for q' 
with respect to all objects in the database. Similarly, 02 is the most probable NN when the query point moves to q'" . 

If a query point is outside the intersection region of a PVC and the corresponding elliptical region, but falls inside 
the PVC, still there is a possibility that the object associated with this PVC is the most probable NN for the query 
point. For example, in Figure 1 1 when the query point is at ^2, then the condition in Equation [2] fails. For this case, 
our algorithm relies on the lower bound of the probability for the object 02 of being the nearest neighbor from the 
query point q2- We define the second condition based on the lower bound probability of an object. 




Figure 12: The incremental approach 



We can compute the lower bound of probability, lp{Oj, q) for object o,- of being the NN from the query point q, 
by using pessimistic assumption. For computing the lower bound probability, we assume that a non-retrieved virtual 
point object is located at the minimum distance from the query point and is just outside the boundary surface of the 
known region. For example, in Figure 1 1 when the client is at q2, we assume that a point object exists at a" '. Then, we 
estimate the probability of the object 02 being the NN to q2, which gives us the lower bound of the probability. 

By using the lower bound, the client can determine whether there is a possibility of other non-retrieved objects 
being the most probable NN from the current query location. If the probability of the virtual point object o v , p(o v , q) is 
less than the lower bound probability of the candidate object o,, lp{pi, q), then it is ensured that there is no other object 
in the database that has higher probability for being the NN of q than that of o,; otherwise there may exist other object 
in the database with higher probabilities for being the NN of q than o,. Thus, our second condition for the guaranteed 
region can be defined as follows: 



lp(oi,q)> p{o v ,q). (3) 

Based on the above observations, we define a probabilistic safe region for an object o,-, as a region where o, is 
guaranteed to be the most probable NN for every point inside that region. Thus, Equation [2] and Equation [3] form the 
guaranteed region for an object o,. 

We use the above two conditions and the local PVD to incrementally evaluate a PMNN query. The algorithm first 
retrieves a set of surrounding objects for the given query point q, and creates a PVD, named IPVD, for those objects. 
Then, the algorithm finds the PVC and the corresponding object o, as the most probable nearest neighbor of the query 
point q with respect to the objects within the known region. If q is inside a PVC cell, the object o, is returned as 
the most probable nearest neighbor if q satisfies Equation|2]or Equation|3] If none of the above condition holds, the 
algorithm requests a new set of objects with respect to the current query point q, and repeats the above process on 
newly retrieved set of objects. 

Discussion: Our pre-computation based approach computes the PVD for all objects in the database and then indexes 
the PVD using an R-tree to efficiently process PMNN queries. Since, the pre-computation of the PVD for the entire 
data set is computationally expensive, the pre-computation based approach is justified when the PVD can be re-used 
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for large number of queries, as the cost is amortized among queries(e.g., [11, 20 1). Thus, the pre-computation based 
approach is appropriate for the following settings: the data set largely remains static, there are large number of queries 
in the system, and the query spans over the whole data space. 

On the other hand, in our incremental approach, we retrieve a set of surrounding objects for the current query 
location, and then incrementally process PMNN queries based only on these retrieved set of objects. Only data close 
to the given query are accessed for query evaluation. As the evaluation of this approach depends on the location of 
the query, this approach is also called the query dependent approach, as opposed to the data dependent approach (e.g., 
pre-computation based approach) where the location of queries are not taken into account. This incremental approach 
is preferred for the cases when there are updates in the database or the query is confined to a small region in the data 
space. A comparative discussion between the pre-computation approach and the incremental approach for point data 
sets can be found in [ 30 , 20). 

6. Experimental Study 

We compare our PVD based approaches for the PMNN query (P-PVD-PMNN and I-PVD-PMNN) with a sampling 
based approach (Naive-PMNN), which processes a PMNN query as a sequence of static PNN queries at sampled 
locations. Though in Naive-PMNN we use the most recent technique of static top-l-PNN queries (8), any existing 
technique for static PNN queries 015] can be used. Note that, by using the existing method in (§1, for each uncertain 
object oi, we could only define a region (or UV-cell) where o -, has a non-zero probability of being the NN for any point 
in this region. Thus, this method cannot be used to determine whether an object has the highest probability of being 
the NN to a query point. Therefore, we compare our approach with a sampling based approach. 

In our experiments, we measure the query processing time, the I/O costs, and the communication costs as the 
number of communications between a client and a server. Note that while the processing and I/O costs are the perfor- 
mance measurement metric for both centralized and client-server paradigms, the communication cost only applies to 
the client-server paradigm. In this paper, we run the experiments in the centralized paradigm, where the query issuer 
and the processor reside in the same machine. Thus, we measure the communication cost as the number of times the 
query issuer communicates with the query processor while executing a PMNN query. 

6.1. Experimental Setup 

We present experimental results for both ID and 2D data sets. 

For 2D data, we have used both synthetic and real data sets. We normalize the data space into a span of 10, 000 x 
10,000 square units. We generated synthetic data sets with uniform (U) and Zipfian (Z) distributions, representing a 
wide range of real scenarios. For both uniform and Zipfian, we vary the data set size from 5K to 25K. To introduce 
uncertainty in data objects, we randomly choose the uncertainty range of an object between 5x5 and 30 x 30 square 
units, and approximated the selected range using a circle. For real data distributions, we use the data sets from Los 
Angeles (L) with 12K geographical objects described by ranges of longitudes and latitudes [311- Note that, in both 
uniform and Zipfian distributions, objects can overlap each other. More importantly, in Zipfian distribution, most of 
the objects are concentrated within a small region in the space, thereby objects largely overlap with each other. Also, 
our real datasets include objects with large and overlapping regions. Thus, we do not present any sperate experimental 
results for overlapping objects. 

For ID data, we have only used syntectic data sets. In this case, we generated synthetic data sets with uniform (U) 
and Zipfian (Z) distributions in the data space of 10,000 units. The uncertainty range of an object is chosen as any 
random value between 5 and 30 units. We also vary the data set size from 100 to 500. These values are comparable to 
2D data set sizes and scenarios. 

For query paths, we have generated two different types of query trajectories, random (R) and directional (D), 
representing the query movement paths covering a large number of real scenarios. The default length of a trajectory 
is a fixed length of 1000 steps, and consecutive points are connected with a straight line of a length of 5 units. For 
each type of query path, we run the experiments for 20 different trajectories starting at random positions in the data 
space, and determine the average results. We present the processing time, I/O cost, and the communication cost for 
executing a complete trajectory (i.e., a PMNN query). In our experiments, since the trajectory of a moving query path 
is unknown, we use the generated trajectories as input, but do not provide these to the server in advance. 

We run the experiments on a desktop computer with Intel(R) Core(TM) 2 CPU 6600 at 2.40 GHz and 2 GB RAM. 
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6.2. Performance Evaluation 

In this section, we evaluate our proposed techniques: pre-computation approach (P-PVD-PMNN) and incremental 



approach (I-PVD-PMNN) in Sections 6.2.1 and 6.2.2 respectively. 

It is well known that pre-computation based approach is suitable for settings when the PVD can be re-used (e.g., 
static data sets) for large number of queries or the query span the whole data space, and on the other hand the incre- 
mental or local approach is suitable for settings when the query is confined to a small space and there are frequent 
updates in the database (e.g., [30, TT|[20)). Since two approaches aim at two different environmental settings and also 
the parameters of these two techniques differ from each other, we independently evaluate them and compare them 
with the sampling based approach. 

6.2.7. Pre-computation Approach 

In the pre-computation approach, we first create the PVD for the entire data set and use an R*-tree to index the 
MBRs of PVCs. On the other hand, for Naive-PMNN we use an /?*-tree to index uncertain objects. In both cases, we 
use the page size of 1KB and the node capacity of 50 entries for the /T-tree. 
Experiments with 2D Data Sets: 

We vary the following parameters in our experiments: the length of a query trajectory, the data set size, and the 
size of the buffer window that determines the number of PVCs retrieved each time with respect to a query point. 

Effect of the Length of a Query Trajectory: In this set of experiments, we vary the length of moving queries from 
1000 to 5000 units of the data space. We run the experiments for data sets U(10K), Z(10K), and L(12K). Since the real 



data set size is 12K, the data set sizes for U and Z are both set to 10K. Figures 13 show the processing time, I/O costs, 
and the number of communications required for a PMNN query of different query trajectory length. Figures 13 'a)-(c) 
present the results for U data sets, where we can see that, for both P-PVD-PMNN and Naive-PMNN, the processing 
time, I/O costs, and the number of communications increase with the increase of the length of the query trajectory, 
which is expected. Figures also show that our P-PVD-PMNN approach outperforms the Naive-PMNN by at least an 
order of magnitude in all metrics. This is because, P-PVD-PMNN only needs to identify the current PVC rather than 
computing top-l-PNN for every sampled location of the moving query. 

The results for both Z (see Figures [T3|d)-(f)) and L (see Figures [T3|^g)-(i)) data sets show similar trends with U 
data set as described above. 

Effect of Data Set Size: In this set of experiments, we vary the data set size from 5K to 25K and compare the 
performance of our P-PVD-PMNN with Naive-PMNN for both U (see Figures 14^a)-(c)) and Z (see Figures [BJd)- 



(f)) distributions. In these experiments, we set the trajectory length to 5000 units. 

Figures [l4"|a)-(f) show that, in general for P-PVD-PMNN, the processing time and I/O costs, and the number of 
communications increase with the increase of the data set size. The reason is as follows. For a larger data set, since the 
density of objects is high, we have smaller PVCs. Thus, for a larger data set, as the query point moves, it crosses the 
boundaries of PVCs more frequently than that of a smaller data set. This operation incurs extra computational overhead 
for a larger data set. On the other hand, for Naive-PMNN, the processing time, I/O costs, and the communication costs 
remain almost constant with the increase of the data set size. This is because, unless the /T-tree has a new level due to 
the increase of the data set size, the processing costs for Naive-PMNN do not vary with increase of the data set size, 



which is the case in Figures 14 a)-(f)). 

Figures also show that our P-PVD-PMNN outperforms Naive-PMNN by an order of magnitude in processing 
time, 2 orders of magnitude in I/Os and number of communications for all data sets. The results also show that 
P-PVD-PMNN performs similar for both directional (D) and random (R) query movement paths. 

Effect of Buffer Window: In this set of experiments, we study the impact of introducing a buffer for processing a 
PMNN query. We vary the value of buffer window from to 400 units of the data space, and then run the experiments 
for data sets U(10K), Z(10K), and L(12K). We set the trajectory length to 5000 units. 

In these experiments, all PVCs whose MBRs intersect with a buffer window centered at q having the length and 
width of the buffer window are retrieved from the /T-tree and sent to the client. The client stores these PVCs in its 
buffer. When buffer window is 0, the algorithm only retrieves those PVCs whose MBRs contain the given query point. 
On the other hand, when buffer window is 100, all PVCs whose MBRs intersect with the buffer window centered at q 
having the length and width of 100 units (i.e., the buffer window covers 100 x 100 square units in the data space) are 
retrieved. In this setting, we expect that the I/O costs will be reduced for a larger value of buffer window, because the 
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Figure 13: The effect of the query trajectory length in U (a-c), Z (d-f), and L (g-i) 
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Figure 14: The effect of the data set size in U (a-c), Z (d-f) 
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Figure 15: The effect of buffer window in U (a-c), Z (d-f), and L (g-i) 



server does not need to access the /T-tree as long as these buffered PVCs can serve the subsequent query points of a 
moving query. 

Figures 15 a)-(c) show the processing time, the I/O costs, and the number of communications, respectively, for 
varying the size of the buffer window from to 400 units for U data set. Figure 15 ^a) shows that for P-PVD-PMNN, 
in general the processing time increase with increase of buffer window. The reason is that for a very large buffer 
window, a large number of PVCs are buffered and the processing time increases as the algorithm needs to check these 
PVCs for a moving query. On the other hand, Figure 15 'b) shows that for P-PVD-PMNN, I/O costs decrease with 
the increase of the buffer window. This is because, for a larger value of buffer window the algorithm fetches more 
PVCs at a time from the server, and thereby needs to access the PVD using the R*-tree reduced number of times. The 
figure also shows that P-PVD-PMNN outperforms Naive-PMNN by an order of magnitude in processing time and 2 
orders of magnitude in I/O. Figure 15 'c) shows that the number of communications for P-PVD-PMNN continuously 
decreases with the increase of buffer window as the client fetches more PVCs at a time from the server. However, for 
Naive-PMNN, the client communicates with the server for each sampled location of the query, and thus the number 
of communications remain constant. 

The results on Z (see Figures [l5|d)-(f)) and L (see Figures 15 'g)-(i)) data sets show similar trends with U data set 
described above. 
Experiments with ID Data Sets: 

For ID data, we have run a similar set of experiments to 2D ones, where we vary the length of the query trajectory, 
the data set size, and the size of the buffer window. 

Effect of the Length of a Query Trajectory: In this set of experiments, we vary the query trajectory length from 1000 
to 5000 units while evaluating a PMNN query for ID data sets. We run the experiments for both U (see Figures [To*|a)- 
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Figure 16: The effect of the query trajectory length in U (a-c), Z (d-f) for ID data 
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Figure 17: The effect of the data set size in U (a-c), Z (d-f) for ID data 



(c)) and Z (see Figures [TBfd)-(f)) data sets. The data set size is set to 100. We can see that, for both P-PVD-PMNN 
and Naive-PMNN, the processing time, I/O costs, and number of communications increase with the increase of the 
query trajectory length for ID sets, which is expected. Figures also show that our P-PVD-PMNN outperforms the 
Naive-PMNN by at least an order of magnitude in terms of processing time, I/Os, and communication costs. 

Effect of Data Set Size: We also run the experiments with varying data set size (see Figures 17 'a)-(c) for U and 
Figures [T7^d)-(f) for Z data sets). In these experiments, the trajectory length is set to 5000 units. Figures show that, 
for P-PVD-PMNN, the processing time, I/O costs and number of communications increase with the increase of data 
set size for ID sets. This is because, for a larger data set, we have smaller PVCs and thereby a moving query needs 
to check higher number of PBRs than that of a smaller data set. Figures 17 a)-(T) also show that our P-PVD-PMNN 
outperforms Naive-PMNN by at least an order of magnitude in all evaluation metrics. The results also show that 
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Figure 18: The effect of buffer window in U (a-c), Z (d-f) for ID data 



P-PVD-PMNN performs similar for both directional (D) and random (R) query movement paths. 

Effect of Buffer Window: In this set of experiments, we study the impact of introducing a buffer for processing 
a PMNN query. We vary the value of buffer window from to 400 units, and then run the experiments for U (see 



Figures 18 'a)-(c)) and Z (see Figures 18 d)-(f)) data sets. In these experiments, we set the data set size to 100 and the 
trajectory length to 5000 units. The experimental results show that P-PVD-PMNN outperforms Naive-PMNN by 1-2 
orders of magnitude for I/O and processing costs, and 2-3 orders of magnitude in terms of communication costs. 

6.2.2. Incremental Approach 

In the incremental approach, we use an /T-tree to index the MBRs of uncertain objects, for both I-PVD-PMNN 
and Naive-PMNN. In both cases, we use the page size of 1KB and the node capacity of 50 entries for the /T-tree. 
Experiments with 2D Data Sets: 

We vary the following parameters in our experiments: the value of £ (i.e., the number of objects retrieved at each 
step), the data set size, and the length of the query trajectory, and compare the performance of I-PVD-PMNN with 
Naive-PMNN. 

Effect of k: In this set of experiments, we study the impact of k in the performance measure for processing a 
PMNN query. We vary the value of k from 10 to 50, and then run the experiments for all available data sets (U, Z, 
and L). In these experiments, for both U and Z, we have set the data set size to 10K. Figures [T9|a)-(c) show the 
processing time, the I/O costs, and the number of communications, respectively, for varying k from 10 to 50 for U 



data set. Figure 19 a) shows that the processing time almost remains constant for varying k. The processing time of 
I-PVD-PMNN is on average 6 times less for directional (D) query paths than that of Naive-PMNN, and on average 13 
times less for random (R) query paths than that of Naive-PMNN. On the other hand, Figures [T9f b)-(c) show that I/O 
costs and the number of communications decrease with the increase of k. This is because, for a larger value of k, the 
client fetches more data at a time from the server, and thereby needs to communicate less number of times with the 
server. Figures also show that our I-PVD-PMNN outperforms the Naive-PMNN by 2-3 orders of magnitude for both 
I/O and communication costs. 

Figures [l9|d)-(f) and (g)-(i) show the performance behaviors of Z and L data sets, respectively, which are similar 
to U data set. 

Effect of Data Set Size: In this set of experiments, we vary the data set size from 5K to 25K and compare the 
performance of our approach I-PVD-PMNN with Naive-PMNN. We set the trajectory length to 5000 units. Also, in 



these experiments, we have set the value of k to 30. Figures 20 (a)-(c) and (d)-(f) show the processing time, I/O costs, 
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Figure 19: The effect of (k) in U (a-c), Z (d-f), and L (g-i) 
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Figure 20: The effect of the data set size in U (a-c), Z (d-f) 
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Figure 21: The effect of the query length in U (a-c), Z (d-f) 



and the number of communications for U and Z data sets, respectively. Figures also show that our l-PVD-PMNN 
outperforms Naive-PMNN by 1-3 orders of magnitude for all data sets. 

Effect of the Length of a Query Trajectory: We vary the length of moving queries from 1000 to 5000 units of the 
data space. In these experiments, for both U and Z, we have set the data set size to 10K. Also, in these experiments, we 



have set the value of k to 30. Figures 21 show that the processing time, I/O costs, and the number of communications 



increase with the increase of the length of the query trajectory for both U and Z data sets, which is expected. The 
processing time of l-PVD-PMNN is on average 5 times less for directional (D) query path and is on average 10 times 
less for random (R) query paths compared to Naive-PMNN. Also l-PVD-PMNN outperforms Naive-PMNN by at 
least an order of magnitude for both I/O and communication costs. 
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Figure 22: The effect of (k) in U (a-c), Z (d-f) for ID data 
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Figure 23: The effect of the data set size in U (a-c), Z (d-f) for ID data 
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Experiments with ID Data Sets: We also evaluate our incremental approach with ID data sets by varying the follow- 
ing parameters: the value of k, the data set size, and the length of the query trajectory. 

Effect ofk: In this set of experiments, we study the impact of k in the performance measure of l-PVD-PMNN for 



ID data sets. Figures 22 'a)-(e) show the results of U and Z data sets, for varying k from 10 to 50. In these experiments, 



we have set the data set size to 100. Figure 22 a) shows that the processing time almost remains constant for varying k. 
Moreover, the processing time of l-PVD-PMNN is on average 6 times less for directional (D) query paths than that of 



Naive-PMNN, and on average 10 times less for random (R) query paths than that of Naive-PMNN. Figures 22 'b)-(c) 
show that the I/O costs and the number of communications decrease with the increase of k. Figures also show that our 
l-PVD-PMNN outperforms Naive-PMNN by 2-3 orders of magnitude in terms of both I/O costs and communication 
costs. 

Figures |22|d)-(f) show the results for Z data set, which is similar to U data set. 

Effect of Data Set Size: In this set of experiments, we vary the data set size from 100 to 500 and compare the 
performance of our approach l-PVD-PMNN with Naive-PMNN. In these experiments, we have set the value of k to 



30 and the trajectory length to 5000 units. Figures 20 (a)-(c) and (d)-(f) show the processing time, I/O costs, and the 
number of communications for U and Z data sets, respectively. The results reveal that the processing time, I/O costs, 
and the communications costs increase with the increase of the data set size. Figures also show that our l-PVD-PMNN 
outperforms Naive-PMNN by at least an order of magnitude for all data sets. 

Effect of the Length of a Query Trajectory: We also vary the length of the query trajectory for ID data sets and 



the results (Figures 24 1 for ID data sets exhibit similar behavior to 2D data sets. In these experiments, we vary the 
trajectory length from 1000 to 5000 units of the data space. Also, we have set the data set size to 100, and the value 



of k to 30. Figures 24 show that for both U and Z data sets, the processing time, I/O costs, and the communication 
costs increase with the increase of the trajectory length. Figures also show that our l-PVD-PMNN outperforms Naive- 
PMNN in all evaluation metrics. 



7. Summary 

In this paper, we have introduced the concept of Probabilistic Voronoi Diagrams (PVDs). A PVD divides the data 
space using a probability measure. Based on the PVD, we developed two different techniques: a pre-computation 
approach and an incremental approach, for efficient processing of Probabilistic Moving Nearest Neighbor (PMNN) 
queries. Our experimental results show that our techniques outperform the sampling based approach by up to two 
orders of magnitude in our evaluation metrics. 
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Figure 24: The effect of the query length in U (a-c), Z (d-f) for ID data 



Our work on PVD opens new avenues for future work. Currently our approach finds the most probable NN for a 
moving query point; in the future we aim to extend it for top-fc most probable NNs. PVDs for other types of probability 
density functions such as normal distribution are to be investigated. We also plan to have a detailed investigation on 
PVDs of higher dimensional spaces. 
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